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Abstract 


Angus,  J.E^  Computer-assisted  improwement  of  mean  squared  error  in  statisticai  estimation.  Mathematics  and 
Computers  in  Simulation  3S  (1993)  1-13. 


A  computer-assisted  method  for  improving  the  mean  squared  error  (MSE)  in  estimation  for  parametric 
models  is  presented.  Assuming  existence  of  nontrivial  sufficient  sutistks,  the  method  involves  generation  of 
Monte  Carlo  samplet  ffom  the  conditional  distributioo  of  the  ohaeivables,  given  the  sufficient  statistic(s).  The 
method  is  IDastiited  ^-connection  with  a  shnpie  badc-propngition  neural  network  model  for  estimating  a 
logistic  regreaion  function,  and  a  specific  mimerical  example  related  to  logistic  regression  is  presented. 


1.  latradMtkM  and  basic  BMthod 

Suppose  that  AT  is  a  randtnn  d-vector  having  density  belonging  to  the  dominated  family 
(/(*;d)>de9)  with  dominating  <r-finite  measure  ii.  Suppose  fc'  the  moment  that  0  is 
i^-valued,  and  that  it  is  desired  to  estimate  9  using  squared  error  loss.  Typically,  many 
estimaton  are  available,  and  sometimes  an  estimator  that  is  optimum  in  some  sense  (e.g.,  a 
uniform  minimum  variance  unbiased  estimator)  can  be  shown  to  exist  in  theory.  Often,  the 
statistician  is  forced  to  use  a  suboptimum  estimator.  For  example,  this  can  happen  if  the 
-optimum  estimator  is  analytically  intrartrtte,  or  tfocowemic  eonsideratioiis  dictate  that  pre-cx- 
isting  algorithms,  created  without  regard  to  optimality,  must  be  used  without  modification. 
Examples  of  the  fmmer  abound,  while  the  latter  situation  exists,  for  example,  in  the  case  where'* 
a  muilinear  regression  function  is  estimated  using  a  back-propagation  neural  network.  White 
[13]  describes  sudi  neural  netwwks,  and  shows  that  the  badt-propagation  algorithm  leads  to 
estimators  of  the  network  weights  that  are  less  efficient  (i.e.,  have  greater  asymptotic  variances) 
than  ordinaiy  nonlinear  least  squares  estimators.  See  also  [7,8,12]  for  des^ptions  of  the 
back-propagation  algorithm,  which  is  a  version  of  stodiastic  gradient  descent.  Angus  [1] 
discusses  connections  between  back-propagation  neural  networks  and  statistical  nonlinear  least 
squares.  White’s  [13]  landmark  paper  connecting  neural  networics  with  concepts  in  asymptotic 
statistical  estimation  presents  a  method  for  ’’correcting”  the  back-propagation  algorithm  to 
make  its  asymptotic  efficiency  equivalent  to  that  of  nonlinear  least  squares.  The  correction  is 
analogous  to  the  method  of  scoring  in  efficient  likelihood  estimation  as  presented  in  [11],  for 
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example,  and  amounts  to  taking  one  nonlinear  least  squaresTlewton-Raphson  step  from  the 
back'propagation  solution. 

A  statistic  T ■  T{X)  is  by  definition  sufficient  for  the  parameter  9  if  the  conditional 
distribution  of  X  given  T  does  not  depend  on  9.  The  usual  method  for  finding  sufficient 
statistics  is  the  application  of  the  factorization  criterion,  which  states  that  under  fairiy  general 
conditions  the  statistic  T  is  sufficient  for  0  if  the  density  function  of  X  factors  as  /(x;  9)  > 
h{x)g{T(x)\  0),  where  the  function  h  does  not  depend  on  0,  and  the  ^notion  g  depends  on  x 
only  through  Tix).  When  a  sufficient  statistic  T  is  available,  and  0  -  9(X)  is  an  estimator  of  0 
having  finite  second  moment,  then  the  Rao-Blackwell  theorem  implies  that  the  new  estimator 
defined  by  E^S(.T)^Ei9\T4t  the  condition^,  expectation -of  0  ^ven  T,  has  smaller  mean 
squared  error  than  0.  That  is, 

MSE(« )  ■  £(5  -  0)U  MSE(0)  •  £(0  -  0)*. 

See  [6]  for  further  details  on  sufficiency,  the  factorization  criterion  and  the  Rao-Blackwell 
theorem.  — 

It  happens  often  that  the  improved  estimator  8  cannot  be  used.  For  example,  the  conditional 
expectation  £(0 1 T)  may  be  difficult  to  compute  an^ically  and/or  economic  constraints  may 
dictate  that  only  the  algorithms  for  computing  0  can  be  used.  For  example,  a  “canned” 
algorithm  package  that  cannot  be  modified  m^  be  the  only  resource  available.  In  the  case  of  a 
neural  network  an  advantage  of  the  back>pr(^>agation  learning  algorithm,  a  relatively  (statist!* 
cally)  inefficient  parameter  estimation  alg^thm,  is  that  it  can  be  implemented  in  hardware  or 
firmware  by  essentially  running  the  networit  architecture  in  reverse.  However,  the  user  of  such 
a  network  would  not  be  free  to  modify  the  estimation  algorithm  to  achieve  greater  effidency. 

If  it  is  relatively  easy  to  generate  (i.e.,  simulate)  independent  and  identically  distributed 
observations  from  the  conditional  distribution  of  X  given  T,  then  the  following  aJgorithm  can 
be  used  to  approximate  S^T)  after  taking  the  observation  of  X.  Here,  0  can  now  be  a 
vector*valued  parameter,  and  the  abbreviation  “iid"  stands  for  “independent  and  identically 
distributed”. 

(i)  U  X"*x  is  observed,  calculate  the  (4>seived  value  t  of  T  by  f  T(.x). 

(U)  Generate  X*j...tX*yvd  according  to^the  conditioiial^itsriibution  of  ilT  given  T*  r. 
(iii)  Compute  (1) 


8*^8*ixr,...,x:,t)~-t9{xn  --  — 

«  iml 

as  the  approximation  to  0(r). 

By  Kolmogorov’s  strong  law  of  large  numbers  [10],  8*  converges  with  probability  1  (with 
respect  to  the  p^ability  space  on  which  the  A)*’s  are  defined)  to  the  mean  of  the  conditional 
distribution  of  9{X)  given  T-r,  i.e.,  to  5(r)-£(0(A’)|r*r).  Also,  it^will  be  shown  that  8* 
achieves  a  reduction  in  mean  squared  error  over  the  original  estimator  0,  although  not  as  great 
a  reduction  as  that  achieved  by  8.  Moreover,^  an  advantage  of  algorithm  (1)  is  that  the 
algorithms  already  existing  for  computation  of  9(X)  can  be  reused  with  the  simulation  data 
without  modification.  That  is,  no  (substantial)  new  algorithms  are  needed.  For  example,  in  the 
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case  of  the  back-propagation  neural  network,  the  initial  observation  X  (called  an  “examplar”) 
is  a  random  vector  containing  an  observation  of  a  set  of  inputs  along  with  the  corresponding 
output.  To  apply  algorithm  (1),  n  simulated  exemplars  X X*  would  be  generated 
independently  and  IdentinOly^istributed  from  the  conditional  distribution  of  X  given  a 
sufficient  statistic  T.  The  network  would  then  be  retrained  n  times  (once  for  each  simulated 
examplar),  and  the  resulting  weights  from  each  of  the  n  training  sessions  would  be  averaged  to 
form  the  new  estimated  weights  having  improved  mean  squared  error.  Of  course,  for  this 
approach  to  be  successful,  the  form  of  the  distribution  of  the  exemplar  vector  X  must  be 
known  and  amenable  to  attraction  of  a  ntmtrivial  sufficient  statistic. 

By  enlarging  the  original  probability  space,  it  may  be  assumed  that  all  the  above  random 
variables  are  defined  on  the  same  probability  space.  Suppose  for  the  moment  that  6  is 
real-valued.  If  5  is  a  random  variable  with  £(5^)  <  x,  the  conditional  variance  of  5  given  T  is 
defined  by  VaifSID  — The  relative  merits  of  the  three  estimators  S,  S 
and  d*  in  terms  of  mean  squared  errors  are  summarized  as  follows  (see  the  next  section  for  a 
derivation  of  these):  .. 

MSE(«)  -  bias2(P)  -»■  Var(« )  +  £(Var(P  I T)),  (2a) 

MSE(fi)-bias^(«)  +  Var(a),  (2b) 

MSE(«  •)  -  bias2(fl)  +  Var(5)  +  i£(Var(«  1 7)),  (2c) 

where  hnsiS)  •  E{$)  -  6,  Thus,  in  theory,  thw  computer-assisted  estimator  can  be  con¬ 
structed  to  have  mean  squared  error  ^arbitrarily  close  to  that  of  the  improved  estimator  8  by 
simply  increasing  n.  In  particular,  if  P  is  unbiased  for  0  (i.e.,  E{0) «  $  for  all  8e0)  and  8  is 
the  uniform  minimum  variance  unbiased  estimator  of  0,  then  is  also  unbiased,  the  mean 
squared  errors  in  (2)  become  variances,  and  5*  can  be  made  to  have  variance  arbitrarily  close 
to  the  optimum. 


2.  Mathcnutkal  background  and  notation 

Following  are  mathematical  and  statistical  notations  and  concepts  that  are  used  in  the 
remainder  of  the  paper.  It  is  assumed  that  the  reader  is  familiar  with  the  rudiments  of 
measure-theoretic  probability,  further  details  of  which  may  be  found  in  [4,10,11]. 

Vectors  will  be  taken  to  be  column  vectors,  and  a  superscript  “t”  will  denote  matrix 
transpose.  R*'  is  the  d-dimensional  Euclidean  space  with  the  usual  norm  II  *  II;  91*'  represents 
the  dass  of  Borel  subsets  of  R*'. 

Random  d-vectors  X  are  measurable  R^-valued  functions  defined  on  a  common  probability 
space  with  sample  space  D,  cr-field  of  events  3,  probability  measure  P  and  expectation 
operator  £.  If  d  1,  is  referred  to  as  a  random  variable.  Njim,  A)  will  signify  (depending 
on  context)  either  the  d-dimensional  normal  or  Gaussian  distribution  with  mean  vector  m  and 
variance-covariance  matrix  A,  or  a  random  vector  having  this  distribution.  If  A’  is  a  random 
d-vector,  Var(  A)  denotes  its  variance  covariance  matrix  £((  A  -  £AX  A  -  EXY). 

A  sequence  of  random  d-vectors  [X^l  n  >  1}  converges  in  distribution  to  the  random 
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^/•vector  X  and  «  -» oo,  written  as  n  -* ao,  if  p{X^  eA)-* P[XgA]  as  n  -* *  for  all 

Borel  sets  ^  with  boundary  dA  satisfying  P{X  €  dA)  •  0.  The  term  ’‘almost  surely"  (a.s.) 
is  synonymous  with  “on  a  set  of  probability  1”.  Thus,  for  example,  X^  -*  **-X  as  n  -» x  means 
that  for  all  w  in  a  set  having  probability  1,  XJiu)  -•  X{u>)  as  n  -•  «.  The  indicator  function  of  a 
set  A,  denoted  by  satisfies  I/t»)  -  1  if  w  £^4,  and  is  0  otherwise. 

If  X  and  Y  are  random  variables  with  £  |  T I  <  »,  then  (one  version  of)  the  conditional 
expectation  of  Y  given  X,  denoted  by  E(Y  |  X),  is  a  measurable  function  of  X,  g(X),  having 
the  property  that  f^g(X)  d£-  f^Y  dP  for  any  event  A  in  the  a-field  generated  by  X.  With 
this  definition  of  £(T|Af),  the  convention  is  made  that  £(y|  wglx)  for  all  x  in  the 
range  of  X.  If  X  and  Y  have  joint  density  with  respect  to  Lebesgue  measure,  and  X  has 
marginal  density  fx  with  respect  to  Lebesgue  measure,  then  g(x)mE{Y\X»x)  may  be 
computed  using  the  formula 

E(Y\X-x).} 

Conditional  expectations  have  the  following  properties: 

£(£(y|jf))-£(y),  if£|y|<«, 

E{h{X)Y  I X) «  h(X)E{Y  |  .Y)  a.s.,  if  h  is  a  measurable  function  of  Af. 

£|h(A’)y)j  <•,  £|y|  <», 

£(y,  +  y,|jf)-£(y, IJir)+£(yjlAf)a.s„  if£|y,l<«,  £|yji<». 

From  these  properties,  the  relations  (2)  in  Section  1  can  be  verified  assuming  that  E(i^)  <  ». 
First,  EiS)  -  £(£(d  1 7))  -  Eih  and  E(S*  1 7)  -  6(7),  almost  surely.  Hence, 

MSE(6) - £(6 -EiS^ESA  ES- - Var(6)  +  bias*(6), 

MSE(6)  -  £(«  -  6)^  -  £(6  -  6  +  6  -  6)* 

-  MSE(6)  +  £(£((6  -  sf  1 7))  +  2£((6  -  0)E{S  -  6 1 7)) 
-MSE(6)+£{Var(6|7)), 

MSE(6*)  -  £(6*  -  ef  -  £(6*  -6  +  6-6)* 

-  MSE(6)  +  e(e({8*  -  8f  1 7))  +  2£((6  -  6)£(6*  -  6 1 7)) 

-MSE(6)  +  i£(Var(6|7)). 


3.  A  central  Unit  theorem 

In  applying  the  algorithm  (1),  it  is  of  interest  to  know  whether  the  conditional  distribution  of 
}fn(8*  -  8)  given  7  approximates,  in  some  sense,  the  unconditional  distribution  of  Vn (6  - 6) 
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when  •  random  sample  X . X,  is  available.  The  former  is  the  disfnbution  that  would  result 

firom  repeated  independent  applications  of  algorithm  (1)  wing  the  same  fixed  value  of  /  on 
each  application,  and  it  depend  on  the  initial  estinutor  S  as  well  as  the  choice  of  sufficient 
statistic  T.  This  conditional  distribution  is  a  type  of  “bootstrap”  distribution  (see  [S]),  because  it 
involve  Monte  Carlo  resampling  from  a  conditional  distribution  that  depends  on  the  outcome 
of  the  original  sampie.lThis  question  of  approximation  is  difficult  to  answer  in  general,  but  can 
at  least  be  addressed  asymptotically  for  important  special  cases. 

To  be  definite,  assume  that  X^,  X2,>..,X„  are  random  </*vectors  that  constitute  a  random 
sample  from  a  d-parameter  regular  eiq>onential  family  with  density  fix;  r}) «  cxpir^'x  -ciri)) 
with  respect  to  a  a>finite  measure  m  on  (R^  91^).  Here,  rj  e  T  c  R*'  where  T  contains  an  open 
rectangle  in  R'',  and  the  function  c  is  twice  differentiable  and  satisfies 


E{X,) 


ac(i7) 

dij  ’ 


WuiX^)^E{X,Xl)-iEX,)iEX,y 


9^c(v) 
dff  drf'  * 


Let  0  ■  iciri)/dri  «  EiXi)  be  the  paranwter  of  interest.  It  can  be  shown  that  VaK  AT,)  depends 
on  17  only  through  6,  and  that  the  Fisher  information  matrix  7(0)  for  0  satisfies  /~'(0)- 
Vai<.Y,)  (see  [9,  p.  127],  for  example).  Suppose  that  it  is  of  interest  to  estimate  O^EiX,) 
efficiently  and  in  unbiased  fashion.  For  this  problem,  it  is  easy  to  show  (see  [9,  Example  5.3,  p. 

.438,  and  Section  6J])  that  the  most  efficient  unMased  estimauw  of  0  is  given  ^  0(3f„. . . ,  X„) 
-  (l/n)Li.iXf  •  5,/n.  Suppose  that  the  initial  estimator  of  0  is  taken  to  be  0  -  A',.  It  follows 
from  the  multivariate  central  limit  theorem  that  i^(0~0)»A^/O, /~'(0))  as  If 

algorithm  (1)  is  applied  in  this  cmitext  with  7-5„,  it  would  be  desirable  to  have,  with 
probability  1,  the  conditional  distribution  of  i/n  (S*  "  0)  given  T  also  converge  to  A^/0,  /~’(0)) 
in  some  sense.  The  following  theorem,  which  applies  to  more  general  situations,  helps  address 
this  issue.  The  result  of  t^  theorem  is  very  similar  to  the  fundamental  Bootstrap  Central  Limit 
Theorem  (see  [2],  for  example).  Before  stating  and  proving  Theorem  2,  the  following  lemma  is 
needed.  Its  proof  is  essentially  the  same  as  that  of  the  result  of  [11,  p.  147,  Problem  4.7]. 


Lamnw  1.  For  each  n  >  1,  frr  A,,,  A„2,...,  A,„  be  indgsendent  R^-valued  random  vectors  with 
£A,,-0.  £A.*A;,-ri,„  and  let  be  the  probabidty  distrdtution  of  A„,.  Suppose  that  as 

n  -»oo, 

«• 

(0 

n  j.i 

(ii)  for  every  «  >  0, 


4,L 


jrl>«v!r) 


I X II  Viufdjf}  -*  0, 


as  n~*». 


Then  iX^y  +  •  •  •  +A„,)/ Vn  N/Q,  A)  asn  -►». 


Theorem  2  can  now  be  stated  and  proved. 
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TlMoran  2.  Assume  that  Jf,.  X2,...  areiid  random  d-vectors  defined  on  a  fixed  probability  space 
{/},  3,  P),  each  havit^  covariance  matrix  A.  Let  +  ••  •  +X„,  and  for  each  n,  let 

’ )  be  a  regular  conditional  probability  charibution  of  Xi  j^ven  5,.  Consumed  on  S„,  let 
^  ^  according  to  • ).  Thm,  with  probabdity  1,  as  «  -* «, 


/  V*  j.  c  \ 

^  A)eA], 

for  every  Borel  setAefh^  with  boundary  dA  satisfying  A)  e  dA]  -  0. 


Proot  Qearly,  E(X^*  1 S,)  -  £(Af,  I  S„)  -  S„/n  almost  surely.  The  idea  of  the  proof  is  to  apply 
Lemma  1  to  the  sequence  {X*-S„/n;  conditioned  on  S„,  and  find  a  set  of 

probability  1  where  the  conditions  of  tite  lemma  apply.  Now  let 

n 


Then  using  obvious  synunetries, 

n  ,-i  «  V 


By  the  strong  law  of  large  numbers,  5„/n  '■*  *^£2r,  as  n  -» •.  Let  o;  be  the  <r>field  generated 
by  (S,,  Then  |S,)-£(A’,2f,*  lo;).  But  o„  decreases  to  n,>io;,  the  tail 

<r>field  of  the  sequence  (5.,  n  >  1),  and  it  follows  [4,  p.  228]  that 

'  ii»l  ' 

as  n  -» 00.  But  is  contained  in  the  (T'field  of  permutable  events  [10,  pp.  373,  374]  and 

hence  contains  only  events  of  probability  0  or  1  by  the  Hewitt-Savage  0>1  law,  so  that  the  latter 
conditional  expectation  is  constant  almmt  surely  [10,  p.  374].  But  since 


£(£(jr,^;|  n «'.))  -£(•*■.^1). 

that  constant  must  be  £(.Y,^,').  It  follows  that 

I  £E(n,nils.) 

"  /-I 


as  n  -» 00.  This  estatdbhes  (i)  in  Lemma  1. 
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Note  that  it  is  sufficient  (and  necessary)  to  verily  condition  (ii)  of  Lemma  1  only  for  rational 
«  >  0.  Atain,  employint  obvious  symmetries,  and  letting  C  >  0  be  a  constant, 

"  i-l 

-  £(  II X,  -  £( Jr,  I SJ II "/( II  Jf,  -  £(,^f,  I SJ II  >  I  S„) 

<  (c  +1  ^1(1  2»e I j£( II X,  -E{X,\  SJ 11  >  1 S,) 

+  II £{ II JT,  II '  >  Cl S,}  +  £( II X^  II  V{ II  .If ,  II ' >  C} I S„ ) 

+  2|~|£(  II X^  fl /{ II .If,  II ' > C} |5„). 

with  all  inequalities  being  understood  as  “almost  sure*’  inequalities.  By  Chebyshev’s  inequality, 

-£(jf,  is,)ll  >«V»  |i.}<(rfr<)‘'j£(llJr,  II  |5,)+|^|J, 

which  converges  almost  surely  to  zero  by  using  arguments  similar  to  those  used  in  verifying  (i) 
of  Lemma  1,  and  the  strong  law  of  large  numbers.  Using  Chebyshev’s  inequality  again  as  well  as 
the  same  arguments  just  mentioned,  it  follows  that  almost  surely 

ii'Scis,>+£(ii  Jr,  bV{b  Jf,  ll’>c)is.) 

+2j^£(BJr,ll/{lljr,B*>C)l5.) 

ic  a*  ■  ‘  —  . 

-jij  E(\\X,\\^\S^)+E{\\X^V^[\iX^\\^>C}\S,) 

+  2j^|£(||Jf,lf/{||,y,l!*>C)|S,). 
which  converges  almost  surely  as  n  <-»  oe  to 

C- '  II  EXi  II  ^£(  II  X^  II  +  £(  II  2f,  II  */{  II  2f,  II  *  >  C}) 

+  2H£^,||£(||Jf,||/{||.lf,||">C}). 

Since  C  was  arbitrary,  letting  C  and  applying  the  Dominated  Convergence  Theorem  {4]  to 
the  last  two  terms,  die  entire  expression  tends  to  0  and  it  follows  that  condition  (ii)  of  Lemma  1 
is  satisfied  almost  surely  for  all  rational  o  0.  The  number  of  exceptional  w  sets  where  the 
aforementioned  calculations  and  inequalities  fail,  as  well  as  those  where  the  /i  >  1,  fail  to 
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be  probability  measures,  is  at  most  countable,  and  each  has  probability  0.  Hence,  the 
conclusion  of  the  theorem  follows.  □ 

It  foliowajirom  Theorem  2  that  if  algorithm  (1)  is  applied  in  the  context  of  the  discussion  at 
the  beginning  oflhis  section,  then  along  almost  all  sample  sequences -A’,, -the 
conditional  distribution  of  ^(6*  -5)  given  5,  converges  in  distribution  to  A/j(0,  I~\e))  as 
n  -*  90,  which  is  the  same  limiting  distribution  as  that  of  -  9). 


4.  AppUcatloo  to  a  back*propagation  neural  network  Iraplementatloa  of  logistic  regression 


To  illustrate  the  method  of  applying  algorithm  (1),  consider  the  logistic  regression  model.  In 
this  model,  a  dichotomous  random  variable  Y  has  the  conditional  probability  mass  function 
P{Y - 1 1  AT}  -  1  -  P{Y •  0 1 2r)  -  F(X*0)  where  F  is  the  cumulative  distribution  function  of  the 
logistic  distribution,  namely  Fix)  ■■  e'/(l  +  e').  Here,  X  is  a  random  p>vector  of  explanatory 
variables,  and  9  is  a  p-vector  of  unknown  parameters.  This  model  arises  in  bioassay,  medical 
diagnosis,  linear  discriminant  analysis  and  many  other  statistical  contexts.  Suppose  that  random 
samples  Y^  ^, ....  ftom  the  conditional  distributions  of  T I  i  -  1, . . . ,  IT,  are  available  and 
it  is  desir^  to  estimate  9.  Several  statistical  techniques  are  available  for  estimating  9,  including 
maximum  likelihood,  and  minimum  logit  chi*squ8re  (see  [3]  and  Section  5).  In  fact,  Berkson  [3] 
studies  the  problem  of  improving  the  mean  squared  error  of  the  minimum  logit  chi-square 
estimator  of  9  through  the  use  of  the  Rao-Blackweil  theorem.  The  initial  estimation  problem, 
however,  also  fits  naturally  into  a  simple  two-layer  back-propagation  neural  network  with 
logistic  sigmoid  response  function  (see  Rg.  1). 

K  exemplars  consisting  of  the  pairs  iX^t  Pi),...,iXK,  p^),  would  be  presented  to  the 
networit  of  Fig.  1,  where  ^?“(l/m,)E7-  ^Y,J.  The  network,  using  the  back-propagation  algo¬ 
rithm,  would  then  “learn"  the  connection  weights  (the  9,’s)  that  provide  a  minimum  of  the 
quantity  Zf.ii0i'-FiX,^$))\  (A  modified  value  of  p,  that  lies  strictly  between  0  and  1  may  be 
necessary  to  avoid  numerical  instabilities,  see  Section  S.)  The  mean  squared  errors  of  the 
estimators  of  thej9j’s  thus  obtained  can  be  improved  using  the  method  of  Section  1  as  follows. 
It  is  assumed  that  the  anal:^  are  all  conditKmaTm  the  values  ofThe  X'j^:That1^^the  2f/s«re 


treated  as  constants. 

It  follows  directly  fiom  the  factorization  criterion  for  sufficiency  that  T  -  HT)  -> 

Z^,ZT^tX,Yij  is  sufficient  for  9.  where  y,., . yj,„^,...,yjc., . 

Denoting  by  jr  the  observed  value  of  K,  it  is  easy  to  show  that  the  conditional  distribution  of  Y, 


Fif.  1.  Simple  two-layer  badi-propatation  neural  network  for  loiistic  retreision. 
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given  7  «■  7(jp),  is  uniform  over  the  region  defined  by  D,  « {z:  T(z)  *■  7(  jr)}.  To  implement  the 
method  of  Section  1,  independent  and  identically  distributed  observations 
each  uniformly  distributed  over  X),.  are  generated  via  a  simulation,  and  the  back>propagation 
network  is  retrained  on  each  of  n  “new**  exemplars 

/-I 

yielding  the  set  of  back-propagation  estimates  of  d.  The  improved  MSE 

estimator  of  $  is  then 


5.  Numerical  example  of  a  computer-assisted  improvement  in  mean  squared  error 

The  following  numerical  example  is  intended  to  illustrate  the  many  different  concepts 
presented  in  this  paper  in  connection  with  algorithm  (1).  Assume  the  logistic  regression  model 
of  the  previous  section  with  p  «  1  (one  independent  variable),  and 

In  order  to  obtain  closed-form  solutions  so  that  comparisons  can  be  made  and  for  ease  of 
exposition,  assume  that  instead  of  minimizing  £fLi(p,  -  F(6X())^,  B  is  estimated  by  minimizing 
Bericson’s  [3]  logit  defined  by 

xl(B)  -  Z  -Pi){9X,  -  logit(  A))^ 

«.i  „  _  — _ 

where  logit(p)«ln(p/(l  -p)),  p€(0, 1).  To  avoid  singularities,  p,  will  be  taken  to  be  the 
modified  estimator 

-  _  Jzl _ 

'  Wi  +  Ti+Tj’  _  — _ 

for  fixed  r,,  tj  >  0,  which  corresponds  to  the  Bayes  estimator  of  p,  *■  FiBXj)  using  squared 
error  loss  and  a  betaCr,,  Tj)  prior  distribution.  It  is  easily  shown  by  differentiation  that 
has  a  unique  minimum  at  the  value  of  $  given  by 

£  A,m<p,(l  -  A)  •ogit(A) 

i-l 

Now,  suppose  that  m, -mj  —  S,  t, -Tj-O.OI,  - 1  and  ^2  “2.  To  simplify 

notation,  define  Ti-Ei-iTij  and  y2"Lf_|y2.i>  s®  that  p, -(7, +0.00/5.02,  P2*(y2  + 
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0.01)/5.02,  the  sufficient  statistic  for  ^  becomes  T  ■  r(y,,  y,)  -  y|  +  lYi  and  the  minimum 
logit  estimator  of  9  is 


g-d(y,.y,) 


#,(1  -p,)  logit(p.)  -p,)  logil(P:) 

Pi(J  “Pi)  +  4P2(1 -Pr) 


Note  that  y,  and  Y2  are  independent  random  variables,  with  Yi  binomiaKS,  F(di)),  i  ->  1,  2. 
In  practice,  $  is  unknown,  but  suppose  that  its  true  value  is  0  <■  1.  In  this  example,  the  expected 
value  of  S  is  easily  computed  by 


5!5! 


£(«■)-  E  »(<'■  1  .■■/(»)<' -mr‘F(2»)'ii-n2>)r'. 

ijmo  **v^  •/•y*w  y*; 

A 

and  the  mean  squared  error  of  6  is 


MSE(0) 


i.J~0 


5!5! 

i!(5-i)!y!(5-y!) 


xf(0)'(l  -P(0))’"'F(20y(l  -F(20))’"\ 


from  which 

£(0)->  0,81296  and  MSE(0)  -  1.15163,  for  0-1. 

Thus,  the  estimator  of  0  is  biased.  Denoting  5, » {(yj,  yj):  yi  +  lyi  ■  t.  0  <y„  y2  <  5),  the 
oondtttonal  distribution  of  (y|,  Y2)  given  T»Yi  +  2Y2  *  t,  is 


-I 


for  (y|,  yjIeS,,  r€{0, 1,...,1S},  and  is  0  elsewhere.  (Note  that  this  distribution  is  not 
“uniform",  since  in  this  example,  the  observables  were  reduced  initially  by  sufficiency  to  y,  and 
yj,  whereas  the  variables  Y^  in  Section  3  were  not)  The  improved  estinwtor  of  0  is 
dCri-ECdlD,  which,  in  this  example,  can  be  computed  fairly  easily  for  any  given  /e 
{0, 1,...,15)  by  the  formula 


The  unconditional  mean  squared  error  of  8  is 

15  j 

MSE(8)-  £(£(0|r-/)-0)  F{r-r), 

r-O 

where  P{T^t)  is  computed  from  the  joint  unconditional  binomial  distribution  of  (Ti,  y2). 
Hence,  from  (2), 

£(Var(0 1  r))  -  MSE(0)  -  MSE(«). 
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Table  1 

Summanr  of  computations 


1 

S,  •«yi.  .V:):  yi  +2y; « l.0<  y,.  y,  <5) 

E{i\T~n  _ 

0 

(0,0) 

3-10-“ 

-3.60000 

1 

(1.0) 

4.6- 10-' 

-1.09429 

2 

(0.1).  (2,0) 

3.810-‘ 

-0.2.?<y44 

3 

(1.1).  (3,0) 

2.4- 10-’ 

-0.45605 

4 

(0.2).  (2.1).  (4.0) 

0.00012 

-0.21953 

5 

(1.2).(3.1).(S.O) 

0.00051 

-0.26307 

6 

(0,3),  (2. 2).  (4.1) 

0.00184 

-0.037% 

7 

(1.3X(3.2),(5.1) 

0.00576 

-0.06173 

8 

(0.4),  (2. 3).  (4. 2) 

0.01564 

0.061 73 

9 

(1.4X(3.3).(S.2) 

0.03704 

0.037% 

10 

(0.5).{2.4X(4.3) 

0.07533 

0.26307 

11 

(1.5),  (3. 4).  (5. 3) 

0.13179 

0.21953 

12 

(2. 5).  (4. 4) 

0.19290 

0.45605 

13 

(3.5X(S.4) 

0.22472 

0.23644 

14 

(4.S) 

0.20362 

1.09429 

IS 

(5.5) 

0.11070 

3.60000 

Notr.  The  values  of  E{t\T ■■  t)  are  independent  of  the  true  value  of  9,  while  the  values  of  P{T r)  are 
computed  asruniat  0*1. 


When  #  « 1,  these  computations  yield 

MSE(«)  -  l.n699  and  £(Var(9|r)) -0.03464. 

Table  1  lists  the  possible  values  of  T,  along  with  the  sets  5,,  values  of  P(r  - 1)  (assuming  0-1) 
and  the  value  of  the  estimator  8(T)  at  T-  r. 

In  practice,  the  data  will  be  collected,  yielding  an  observed  value  for  T,  say  7  -  f  -  11,  but 
typically  it  will  happen  that  the  computation  of  5(f)  is  intractable.  Algorithm_(l)  would  then 
allow  one  to  approximate  the  estimate  5(f)  as  follows.  Take  n  -  1000,  for  example. 

(0  Generate  (y,*(f),  y2*(f)),  i  -  1,...,1000,  independent  and  identically  distributed  from 
the  distribution 

(ii)  Cdmpute 

I  lOOO 

**-io6o 

Canying  out  this  simulation  algorithm  yielded  a  value  of  5*  -  0.221S,  which  is  very  close  to 
the  exact  value  of  5(11)  -  0.21933. 
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By  formula  (2),  for  this  example,  this  procedure  yields  an  estimator  S*  with  unconditional 
mean  squared  error 

MSE(6*)  -  MSE(«)  +  0.001  £(Var(ei7'))  -  1.11699  +  (0.001  )(0.034 64)  -  1.11702. 

Hence,  while  the  computer^assisted  estimator  of  6,  namely  8*,  has  the  same  unconditional 
mean  as  the  original  estimator  S  (i.e.,  £(£*) »  £(9) »  0.81296),  its  mean  squared  error  ateut 
the  true  value  of  0-1  has  been  improved  (MSE(8*) - 1.11702  compared  to  MSE(0)- 
1.14163). 

Clearly,  by  the  deOnition  of  the  estimate  8(11),  this  algorithm  yields  an  approximation  to 
8(11).  Of  course,  in  this  example,  8(11)  is  easily  computed  and  the  algorithm  is  not  necessary. 
In  other  situations,  however,  the  generation  of  random  samples  from  the  conditional  distribu* 
tion  of  the  observable  given  the  sufficient  statistic  will  be  easy  compared  to  the  direct 
computation  of  8(/). 


5.  Summary  and  concluaions 

A  computer*assisted  Monte  Carlo  method  for  improving  the  mean  squared  error  in  paramet* 
ric  estimation  has  been  presented.  The  method  assumes  that  it  is  possible  to  derive  a  nontrivial 
sufficient  statistic  for  the  unknown  parameteKs),  and  that  is  is  relatively  straightforward  to 
simulate  the  conditional  distribution  of  the  observables  given  the  sufficient  statistic.  In 
addition,  a  central  limit  theorem,  similar  to  the  central  limit  theorem  for  the  bootstrap  mean, 
has  been  proven,  which  demonstrates  that  for  an  important  class  of  problems  the  Monte  Carlo 
distribution  of  ffie  computer<assisted  estimator  (asymptotically)  approximates  the  sampling 
distribution  of  the  **ideal”  esliniator  for  the  problem  (i.e.,  the  estimator  that  the  computer>as> 
sisted  estimator  is  aiming  to  approximate).  The  application  of  the  method  to  the  improved 
estimation  of  the  coefficients  of  a  logistic  regression  function  that  has  been  implemented  on  a 
simple  back’propagation  neural  network  has  been  illustrated.  Finally,  a  numerical  example 
related  to  the  logistic  regression  function  has  demonstrated  the  mean  squared  error  improve¬ 
ment  as  well  as  the  concepts  behind  and  implementation  of  the  algorithm. 

The  idea  of  computer-assisted  statistical  analysis  is  not  new.  An  idea  similar  in  spirit  to  that 
ooasidered,in  this  paper  is  the  technique  bootstrapping  (5],  which  has  revolutionized  the  *• 
practice  of  applied  statistics.  In  addition,  a  carefiil  consideration  of  the  theory  behind  the 
algorithm  presented  in  this  paper  shows  that  the  algorithm  is  just  a  simple  Monte  Carlo 
technique  for  evaluating  an  integral,  that  is,  the  (possibly  vector-valued)  integral  representing 
the  Ertt  moments)  of  the  conditional  distribution  of  the  basic  estimator  given  the  sufficient 
statistk.  Therefore,  the  method  could  be  improved  further  by  implementing  improved  Monte 
Carlo  integration  techniques,  an  area  that  is  currently  being  accelerated  by  the  work  of 
Wozniakowdd  [14]  and  others.  In  this  direction,  more  information  would  be  needed  concerning 
the  conditional  distributions  than  simply  an  ability  to  generate  random  samples  from  them. 
Also,  the  current  improved  integration  tedmiques  would  address  only  integrals  involving 
distributions  having  densities  with  respect  to  Zjeb^gue  measure,  whereas  the  simple  algorithm 
(1)  works  in  general.  Whether  the  additional  improvement  in  mean  squared  error  (if  any) 
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achievable  by  improved  Monte  Carlo  integration  is  worth  the  cost  of  the  added  computational 
complexity  for  certain  estimation  problems  is  an  area  for  further  research. 
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